Re: revised "generic syntax" and "data:" internet drafts

Edward Cherlin (cherlin@newbie.net)
Thu, 3 Apr 1997 21:35:52 -0800


Message-Id: <v03007805af69d830da5e@[206.245.192.34]>
In-Reply-To: <Pine.SUN.3.96.970403120236.245D-100000@enoshima>
Date: Thu, 3 Apr 1997 21:35:52 -0800
To: uri@bunyip.com
From: Edward Cherlin <cherlin@newbie.net>
Subject: Re: revised "generic syntax" and "data:" internet drafts

Martin Duerst wrote:
>On Wed, 2 Apr 1997, Larry Masinter wrote:
>
>> In my personal judgement, there was significant controversy
>> about adding to a Draft Standard document additional constraints
>> that were not part of the Proposed Standard and are not
>> implemented in at least two interoperable implementations.

What constraints? The proposal is to use %HH encoding of UTF-8 encoding of
Unicode in ASCII URLs as the standard way to handle those non-ASCII
character sets that have bidirectional mappings to Unicode, so that the
characters can be positively identified. But as written it would have no
effect on anyone using, say, ISO-2022. It does not forbid use of national
or other partial character sets standards, but encourages the use of an
unambiguous notation.

>In the current discussion, started by my original proposals in
>mid February, there was definitely no "significant controversy"
>about procedural matters such as those you mention above.
>If you think otherwise, please give the references to the
>mailing archive. As you can see below, there is no need for
>such a controversy. If you had brought this subject up earlier,
>I could have answered as below earlier.
>
>Also, there are in no way any additional constraints.
>There are only recommendations. I have clearly shown
>that these don't affect existing (or even future)
>implementations in any major way. If you want to challenge
>this, please give the details.
>
[snip]

>> >   URL creation mechanisms that generate the URL from a source which
>> >   is not restricted to a single character->octet encoding are
>> >   encouraged, but not required, to transition resource names toward
>> >   using UTF-8 exclusively.
>> >   URL creation mechanisms that generate the URL from a source which
>> >   is restricted to a single character->octet encoding should use UTF-8
>> >   exclusively.  If the source encoding is not UTF-8, then a mapping
>> >   between the source encoding and UTF-8 should be used.
>> >
>> This is an additional requirement that does not correspond,
>> as far as I can tell, to any kind of "implementation experience".
>> I know of no URL creation mechanisms that actually do this.
>
>See above. "implementation experience" is obviously trivial.

What sort of implementation? Does it have to be part of some Internet
application, or can I just write a little utility? Anyway, it's just code
page mapping and table lookup, and the tables are all provided along with
The Unicode Standard, Version 2.0 volume from Addison-Wesley, and are
available at the Unicode site. What's the big deal?

>> Further, I think that the complaints that there is a certain
>> amount of ambiguity in practice over exactly how one goes
>> about doing this are legitimate, and that not only is there
>> no "running code", there is not "rough consensus".
>
>The code that we have is obviously very much sufficient.
>Rough consensus is there, the word "rough", as I have seen
>it interpreted in IETF working groups, takes care of the
>case of a single individual raising the same far-fetched
>and unrelated complaints over and over, in a rather short
>and cryptic manner, even after they have been addressed
>in detail.

Surely we don't accept the old Polish veto (yelling "NO" from the sidelines
without actually participating in the discussion)? Has anyone raised a real
technical objection, not just a "You're wrong" or "I don't like it"?

>I don't know exactly what you intend to refer to with
>"certain ambiguity". If you mean ambiguities arising from
>URLs such as http://0oO0Il1.com/IlIl10oO.html, this is
>obviously a problem that is ignored for ASCII, because
>of the correct assumption that URL generators learn to
>avoid such cases by trial and error if not otherwise.
>I do not think that at the present time, things beyond
>ASCII need to be specified more explicitly than ASCII
>itself, in this respect.

Can we try that again? We know what UTF-8 is, without any ambiguity. We
know how to put Unicode text into canonical form, without ambiguity. We
know how to do %HH encoding, without ambiguity. Where is the ambiguity?

Are we talking about text such as

Latin Capital Letter A            U+0041
Greek Capital Letter Alpha        U+0391
Cyrillic Capital Letter A         U+0410

Which is visually indistinguishable from 'AAA'? This may be a bit of a
problem for systems that try to display the characters properly, but only
if they don't provide access to the %HH-encoded original UR* or to the
numeric values of the characters, and don't provide any indication of the
script of the characters when selected. This problem is outside the scope
of this group, which is concerned with correct rendering, transmission, and
interpretation of UR*s in software and in print, not their presentation to
the user by software.

>I very well acknowledge that for some cases, some more
>detailled specifications are highly desirable. I have
>talked with many people about the issues involved, and
>I have repeatedly volunteered to work on the necessary
>documemnts. However, I do not see any sense in writing
>such documents in the void, without a clear commitment
>for a good solution in the central document. Actually,
>I would like nothing more than finishing the current
>controversy on the base issue and having some time to
>work on more documentation. I therefore sincerely hope
>that we can stop useless "procedural concerns" as above
>as quickly as possible. [Also, as long as we are only
>concerned with %HH (this is the only thing that should go
>into the current draft, I agree that the transition to
>using "native" URLs is something more experimental, and
>that the necessary documents for it will have to be written),
>the potential ambiguities actually don't arise :-].

Yes, that all comes later. However, it will turn out to be much easier when
we finally get there. This all reminds me of the flailing about in Europe
trying to establish a European currency. If they just did it without trying
at the same time to keep all of their old currencies, it would all turn out
to be much easier, just as in the U.S. 200 years ago when the Federal
currency replaced the state currencies.

I realize that we cannot do this in the case of Unicode URLs, and I am not
suggesting that we try. We will have to provide backward compatibility as
far as possible. I merely claim that pure Unicode URLs will turn out to be
far simpler than the current set of kluges. As far as I am concerned,
Martin's proposal is a model of simplicity and clarity, meets a real need,
and need not bother anyone who isn't interested.
>
>> > I'm surprised, too. I thought we had this worked out, and that
>> > there was no significant objection or controversy.
>>
>> I hope that the domain name from which you post ("newbie.net")
>> isn't some kind of joke. If you insist, I will forward you
>> the three hundred or so email messages discussing the controversy
>> around the proposed additions.

Ad hominem, is it? Quite unworthy of you. After all, you could have looked.
Very well, I will tell you--undoubtedly more than you want to know, but you
brought it up.

NewbieNet <http://www.newbie.net> is a three-year-old information service
for new Internet users offering the NewbieNewz mailing list and several
Web-based courses and information resources (New Newbie Pages, CyberCourse,
Netiquette course, Unofficial Smiley FAQ, Frames Tutorial, and more). I
have been using and writing about computers for 20 years, starting with
timesharing (Amdahl 470) and the Commodore 64, Radio Shack TRS-80, CP/M,
and Apple II, in APL, FORTH, BASIC and several assemblers.

My involvement with multilingual software issues goes back to my experience
with mixed Korean/English typesetting in the Peace Corps in 1967, and
includes typesetting an APL magazine in every technology from manual
pasteup of daisywheel printouts to Aldus Pagemaker and PostScript APL
fonts. I now review multilingual software, with emphasis on Unicode
support, for Multi-Lingual Computing magazine.

In 1986 I organized and led a project to create a fully portable ISO/ANSI
standard APL interpreter called I-APL which could run on any 8-bit or
better computer and did run on the Apple II, Commodore 64, BBC Micro, and
others, plus PC, Mac, and UNIX. It could be made to display in most writing
systems. We put out versions of I-APL in English, French, German, Finnish,
Russian, and Japanese (no Kanji support, of course). I participated in the
ISO and ANSI APL standards development process, and the associated effort
to get APL characters included in ISO 10646 and Unicode.  I also have
experience in typesetting math and music.

I have written market research reports on Non-Latin fonts (1991) and the
impact of Unicode (1994). Unicode is becoming a standard feature of many
present and most promised operating systems, programming languages, Web
browsers, E-mail software, industrial-strength database, and office suites
from Microsoft, Lotus, and Corel. In particular, the character type in Java
is defined to be a 16-bit Unicode character. Many people, like myself,
depend heavily on Unicode for Web browsing and for publication. I read
French, German, Hebrew, Yiddish, Russian, Chinese, Korean, and Japanese in
varying degrees, and would like to be able to view all of the others
correctly.

I would use Unicode mail for preference if my correspondents could all
receive it correctly, but that will take a few years. At present we are
preparing to use Unicode mail on the Unicode mailing list as an experiment.
I am particularly keen on getting Unicode accepted as the standard
character set for everything to do with the Internet. It won't be a true
World-Wide Web until everyone can publish in their own languages and
writing systems so that everyone else can see it properly. Until Unicode is
the standard character set, there will be no standard for creating and
viewing multilingual documents, or even single language, single script
documents in anything other than ASCII.

>I guess there is no need to do that. Edward is very well aware
>of the discussion that went on. Some of the best contributions
>to it are from him. He probably followed the discussion more
>closely than many others. Threatening him with mail flooding
>is beyond what I want to comment about.
>
>
>Regards,	Martin.

Yeah, me too.

--
Edward Cherlin     cherlin@newbie.net     Everything should be made
Vice President     Ask. Someone knows.       as simple as possible,
NewbieNet, Inc.                                 __but no simpler__.
http://www.newbie.net/                Attributed to Albert Einstein