Re: revised "generic syntax" internet draft

Edward Cherlin (cherlin@newbie.net)
Thu, 24 Apr 1997 00:09:11 -0700


Message-Id: <v0300780baf8466ef2424@[206.245.192.47]>
In-Reply-To: <335B6488.1BD1@parc.xerox.com>
Date: Thu, 24 Apr 1997 00:09:11 -0700
To: uri@bunyip.com
From: Edward Cherlin <cherlin@newbie.net>
Subject: Re: revised "generic syntax" internet draft

Larry Masinter <masinter@parc.xerox.com> wrote:

>> Well, they print something like http://WEB.SANYO.CO.JP/FOODSHOP,
>> where upper case is Japanese characters.
>Actually, this is unsatisfactory. What, exactly, would they
>print? Would they print "http://" too? Will Japanese users
>find that familiar and comfortable? I'm afraid "something
>like" isn't useful as a specification.

OK, for a hypothetical Isshin Depaato/Produce (because I don't have the
Kanji for Sanyoo handy) they use three different switchable input methods
(ASCII [romaji] keyboard; Kana to kanji or, for foreigners, romaji to kanji
conversion; kana keyboard) to enter a text which we can represent with
ASCII and Unicode as:

h
t
t
p
:
/
/
U+4E00 I     [kanji] one
U+5FC3 Shin  [kanji] heart
U+30C2 de    [katakana]
U+30D1 pa    [katakana]
U+30FC --    [katakana] vowel extension
U+30C8 to    [katakana]
/
U+516B ya    [kanji] 8
U+767E o     [kanji] hundred
U+5C4B ya    [kanji] shop
/

Actually the software will most likely use one of the double-byte Japanese
codes that includes places for the ASCII characters, and we are supposing
that this will be translated to pure Unicode, and then to UTF-8 and so on.
Now I am not giving the full set of encodings that this URL goes through,
because this translation is algorithmic. You can look up the Unicode for
the ASCII characters yourself, if you insist on having them.

If absolutely necessary, I can specify the keystrokes and screen display
for the whole process, using the multilingual tools available to me. Since
they are not the same as the tools used in Japan, I don't think this would
add anything to the demonstration.

If anyone objects that this is still not definite enough, I invite them to
test out a Mac with a Kanji OS, or a PC with Kanji Windows, such as would
appear in the Japanese market. The details differ, but the process of
entering text is basically the same--multiple input methods which together
handle ASCII, symbol and dingbat fonts, Kanji, and both types of kana.

Japanese schoolchildren are all taught all four scripts in school. Japanese
word processing software supports all four scripts. Books, magazines, and
newspapers in Japan routinely mix all four scripts. Even street, train, and
subway signs in Tokyo and some other cities mix all four scripts.

>
>>                     Of course, for this we have
>> to assume that DNS works with characters beyond ASCII, but that's
>> a separate problem that can be solved (see draft-duerst-dns-i18n-00.txt).
>
>I fundamentally disagree with your idea that we can
>promote the solution to a problem in pieces, where the
>pieces, just by themselves, don't actually solve a
>problem and, in fact, introduce interoperability
>difficulty. So I'm unwilling to "assume" that other
>pieces of the solution will be introduced in order
>to make a whole.

As far as I have heard in this discussion, the only places where
interoperability would be a problem with this proposal are already an
insoluble problem without it. Do we object so strongly to breaking usage
that violates the existing standards, and is already broken?

>> This is entered as such into a browser. We assume that those users
>> that are the target of the Sanyoo depaato food shop page can read
>> Japanes and have equipment that allows them to input Japanese.
>> I won't go into the details of entering the corresponding characters,
>> it's a process the Japanese computer users are very familliar with.
>
>No, I'm sorry, this is completely inadequate. I'm vaguely familiar
>with a number of of Japanese typing methods, and I believe
>that you've not been specific enough. What happens with the
>codes for "http://", for example, since these are not 'Japanese
>characters'? What about unusual names which seem to be printed
>with furigana in Japanese newspapers?

I see you know enough about Japanese writing to be dangerous, but not
enough to be helpful. :-) I repeat: The ASCII characters are typed on a
romaji keyboard layout, which students must learn in school. The
double-byte Japanese character codes include ASCII.

Unusual names using characters not in the standard code sets cannot be used
in URLs either in current non-standard practice or in our proposal based on
Unicode. This is not an issue, since no page designer or Webmaster would
try to use characters that cannot be represented in the computer.

Names using characters in the code set that require a pronunciation guide
when printed are a little harder to enter than more common characters, but
they can be entered using kana conversion if the pronunciation is in the
conversion dictionary, or by radical/stroke count/menu selection or one of
the other indexing methods otherwise. Perhaps the code point number would
have to be provided with such a character when used in an URL, alongside
the pronunciation.

>> The browser then would convert the Japanese characters into UTF-8
>> and (add %HH encoding) and pass the URL to the resolver machinery,
>> where the host part would be resolved with DNS, and then the machine
>> at the corresponding IP number would be contacted with HTTP.
>
>
>This discussion applies only to HTTP URLs, though. You're
>proposing that the recommendation be put into place for
>all existing URL schemes and new versions of them, too.

This is merely an illustration. It is obvious that we would have to do the
same process to support ftp:, gopher:, telnet: and other URLs using
non-ASCII Unicode characters. Again, the DNS would need the ASCII encoded
domain name, and the server at that site would deal with the rest.

I have said at least ten times in this discussion, with no acknowledgement
from anyone, that we are to assume that people will not publish Unicode
URLs without knowing that their servers support them.

If I am going to create an ftp: site, and I don't check what version of
what ftp server I'm using, I'm a fool, and likewise for gopher: and telnet:
and the others. If I put out an https: URL and I don't have a secure server
to receive it, I'm a fool. If I intend to accept encoded UTF-8, I need to
find out how my server can deal with it. If I don't intend to accept it, I
can regard encoded UTF-8 in URLs as plain ASCII, without breaking any
process that is not already broken.

>> That
>> machine would of course have been set up so that the correct page
>> is returned.
>
>How, please, is the machine set up? What has to be done at
>the server & system administration level? What's the transition
>strategy for a server that wants to serve current clients
>as well as these new browsers that can deal with the proposal
>you're promoting?

Strategy 1. Use the ASCII-encoded URL as is to find the requested page
(i.e. use the encoded directory and file names, or whatever), or pass
encoded data to the application for processing, including determining the
need for character set conversion. In other words, make no changes to the
server whatsoever, and still have full support for UTF-8 URLs.

Strategy 2. Put in as many bells and whistles as you like. Let the server
convert, where possible, from ASCII to UTF-8 to Unicode to the local
character set. Let the server convert other %HH-encoded data to whatever
data format the page designer requests.

Neither of these strategies will work as well as we want in all cases. The
cases that fail, however, are broken today. We are not introducing new
points of failure. The full solution will require detailed design in order
to succeed in all cases, but the interim solution only needs to permit
success *for those motivated to get the software they need*.

>> I hope this explanation is detailled enough. If you don't understand
>> some part of it, please tell us.
>
>As you see, it was inadequate for the purposes of
>being a stand-in for 'running code': there are
>a number of unresolved design issues in your plan,
>those design issues must be resolved before interoperable
>implementations can be deployed, and I'm uncertain
>as to whether the results, when taken in total,
>actually solve the problem you set out to solve,
>or even improve the situation significantly. And
>given the difficult transition strategy and lack
>of interoperability with currently deployed systems,
>I doubt that a proposal will actually be adopted
>unless that's so.

Problems to be solved:

Provide for URLs which can be (but are not required to be) displayed in a
non-Latin or extended Latin script meaningful to the user.

Provide for receipt of non-ASCII text data in URLs, including multilingual,
multiscript form input.

We have proposed an ASCII encoding of Unicode to be recommended, but not
required, in both of these cases. We have shown how such URLs can be
generated using simple filter programs. We have demonstrated the use of
such URLs in existing browsers. We have heard from major implementors that
this is their intended future direction. We have heard from users that they
need this capability now. We have heard that numerous other standards will
incorporate Unicode, and provide for the necessary character set encodings
and conversions.

On the other hand, it may not matter much whether this gets into this
standard. We have begun the process of implementation, and those of us who
want to do it are agreed that this very simple and obviously incomplete
proposal solves many of our problems, so we will go do it no matter what
anybody else says. Then we will come back and ask for a detailed standard
based on what we have done.

>That's why "something like" is inadequate above.
>If someone had running code, they could just run it
>and show us what the results were.
>
>Regards,
>
>Larry
>--
>http://www.parc.xerox.com/masinter

Very well, we have had several people post patches for servers. Can we have
the servers themselves set up somewhere, with a selection of pages in
various scripts and with assorted URL formats? We'll want at least Japanese
Kanji and kana URLs, and a multilingual form of some sort.

"I'll bet you can't turn *that* into a pumpkin."--Witches Abroad, by Terry
Pratchett

--
Edward Cherlin     cherlin@newbie.net     Everything should be made
Vice President     Ask. Someone knows.       as simple as possible,
NewbieNet, Inc.                                 __but no simpler__.
http://www.newbie.net/                Attributed to Albert Einstein