Re: revised "generic syntax" internet draft from Martin J. Duerst on 1997-04-21 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Mon, 21 Apr 1997 17:46:36 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
Cc: uri@bunyip.com, fielding@kiwi.ics.uci.edu, Harald.T.Alvestrand@uninett.no
Message-Id: <Pine.SUN.3.96.970421153537.245K-100000@enoshima>
On Mon, 21 Apr 1997, Larry Masinter wrote:

> > Well, they print something like http://WEB.SANYO.CO.JP/FOODSHOP,
> > where upper case is Japanese characters. 
> 
> Actually, this is unsatisfactory. What, exactly, would they
> print? Would they print "http://" too? Will Japanese users
> find that familiar and comfortable? I'm afraid "something
> like" isn't useful as a specification.

I put in the "something like" because I don't assume you, or
everybody else in the group, could view or read actual
Japanese characters. As for the "http://", I guess neither
Japanese nor US users find that very comfortable. And on
many browsers, you can just leave it out.


> >                     Of course, for this we have
> > to assume that DNS works with characters beyond ASCII, but that's
> > a separate problem that can be solved (see draft-duerst-dns-i18n-00.txt).
> 
> I fundamentally disagree with your idea that we can
> promote the solution to a problem in pieces, where the
> pieces, just by themselves, don't actually solve a
> problem and, in fact, introduce interoperability
> difficulty. So I'm unwilling to "assume" that other
> pieces of the solution will be introduced in order
> to make a whole.

We don't have to "assume". Indeed we don't know whether the DNS
guys will be convinced to adopt a solution for i18n URLs, and
how this solution will look. And as long as that work hasn't
been done, we can't use beyond-ASCII characters in domain names,
and so we won't.
But what if they would decide to work on a solution and to
deploy it, but would have to decide "well, we can't, because
character encoding in URLs is not working"? The basic problem
currently is that URLs have an extremely well-defined way to
represent ASCII characters, but are useless when it comes to
represent any other characters.
The IMAP case has very well shown this. The IMAP protocol
has a very specific way of encoding beyond-ASCII characters
(called "modified UTF-7"). They would very much like to know
how such characters are encoded in URLs, so that they can
have the respective URLs and can tell implementors what to
do to convert from URLs to IMAP protocol calls.
When they turn to the URL definition (as it stands with
the current draft), the only answer they get is "it's
undefined". What this translates to is "it's am mess, and
we didn't care to solve it".


> > This is entered as such into a browser. We assume that those users
> > that are the target of the Sanyoo depaato food shop page can read
> > Japanes and have equipment that allows them to input Japanese.
> > I won't go into the details of entering the corresponding characters,
> > it's a process the Japanese computer users are very familliar with.
> 
> No, I'm sorry, this is completely inadequate. I'm vaguely familiar
> with a number of of Japanese typing methods, and I believe
> that you've not been specific enough. What happens with the
> codes for "http://", for example, since these are not 'Japanese
> characters'? What about unusual names which seem to be printed
> with furigana in Japanese newspapers?

Short answer: Japanese computer user know how to enter Japanese.
For URLs, we can safely assume this, and don't have to make it
part of the spec. The layout of the QUERTY keyboard also is not
part of the URL spec, and if it was, it wouldn't help me, because
I'm using a different keyboard layout.

Long answer: Some solutions for "http://" have been discussed
above. If the user really wants to type these letters, Japanese
input methods have a way to change to "half-width English letters"
input, as they call it. This method can be a menu point, a button
on a floating window, a keyboard shortcut, or usually several
of them. I hope you don't ask me how a Japanese user is selecting
a menu point or pushing a button, or what exactly the shortcuts are.
For general syntax characters such as "/", ".", ":", and so on,
they are usually accessible even when we are in Japanese input
mode. In some configurations, they produce the "wide" variant
of these letters, but that would have to be set differently
for the URL input field. Careful UI design at work and a
"quality of implementation" issue, nothing for an internet
standard to worry about.
Unusual names and such are printed in newspapers with kana
following in parentheses, only magazines print them as furigana.
Because most input methods are based on phonetic input, the
furigana actually help to input the characters (as long as
the input method dictionary has the name in question available).
If not, there are other ways to enter the characters. In many
cases, the constituent characters of the name are very basic
and familliar characters, so inputting them one-by-one is
very easy. The names Igarashi or Hasegawa may serve as
examples. If that also fails, one usually has other
ways to look up and input a character, for example by
radical or by actual code number. There are/were at least
25 different versions out on the market of what is called
"Wordprocessor Dictionary" that allow to find characters
with traditional lookup methods and find its JIS code.
I hope we don't have to go into the details of traditional
character dictionary lookup to complete our work on
URL internationalization :-).
Anyway, people interested in users being able to easily
input the URLs they have published would most probably
avoid such difficult cases wherever possible.

Conclusion: We are discussing about URL internationalization
here. URLs lack a consistent/relevant mechanism for encoding
beyond-ASCII characters, but there are is a proposal to remedy
this, which should be adopted. In the process of adopting this
proposal, we can safely assume that the Japanese have done
whatever they feel is necessary and possible to handle their
writings on the computer, and that they know how to do this,
and that even if some aspects of it might look extremely
strange and clumsy and unnecessary and whatever to us,
it's not our job to judge about this, or to try to convert
them to use ASCII only and English. If they would have
seen an advantage in doing so, they would have done so
100 years ago or 50 years ago.

We need to solve the URL problem, not the Japanese input problem!


> > The browser then would convert the Japanese characters into UTF-8
> > and (add %HH encoding) and pass the URL to the resolver machinery,
> > where the host part would be resolved with DNS, and then the machine
> > at the corresponding IP number would be contacted with HTTP.
> 
> This discussion applies only to HTTP URLs, though. You're
> proposing that the recommendation be put into place for
> all existing URL schemes and new versions of them, too.

Yes, of course. Let's have a look at them (as of RFC 1738):

ftp:	Is transitioning towards UTF-8 all by itself. No problem
	here.

mailto: Does not allow beyond-ASCII characters. If it ever
	comes to going i18n, it will be happy to know that
	it can rely on beyond-ASCII characters being safely
	encoded in URLs.

telnet: Nothing much to be internationalized except of the domain name.

file:	Haven't seen that in a newspaper or magazine, for obvious
	reasons. Is local, and so the conversion to and from
	UTF-8 can very well be short-cirtuited. If you want
	to include a comment to that effect, nobody will mind.
	That's what the "recommendation" is for (among else).

gopher:	I am not very familliar with this. From RFC 1738, it becomes
	apparent that Gopher+ allows for the equivalent of
	"Accept-Language:", but it is not clear how beyond-ASCII
	characters get encoded. From reference [2] in RFC 1738,
	it becomes evident that gopher+ is about as ignorant of
	the issues of internationalization as some other protocols,
	it just says that "undisplayable characters should be avoided,
	but if necessary, ISO-Latin1 should be used" :-(.

news:	Newsgroup names and message IDs are currenly ASCII only.
	So the same comments as for "mailto" apply.

nntp:	Same as for "news:".

wais:	Wais is based on Z39.50, which as far as I know has a
	background in libraries and the like, which in turn
	has a strong track record of solving character encoding
	problems upfront. Reference [7] in RFC 1738 is not
	available anymore. If you know where the currently
	relevant spec is, please give me a hint.

prospero: The protocol spec says things such as that HSOnames
	(host specific object names) are defined to be ASCII only,
	but that <hsoname-type> exists because on some file systems,
	filenames might not be ASCII only.

The conclusions is: some protocols don't currently need i18n of
URLs, but UTF-8 will come in handy once  they do. Some protocols
already are going the right way. Some protocols are badly designed
(solving things for ASCII and ignoring the rest) and it may be
difficult to improve them. If a protocol has some explicit reason
for why it can't use UTF-8, or for why it can't transition the
current URLs, then that's it's own problem of bad design.


> > That
> > machine would of course have been set up so that the correct page
> > is returned.
> 
> How, please, is the machine set up? What has to be done at
> the server & system administration level?

Well, for the domain name, that depends on the solution that
will be taken for that problem. For the path/file component
in the HTTP case, some solutions have already been shown,
and others have been discussed in detail. I don't think that
I need to rehash them here.


> What's the transition
> strategy for a server that wants to serve current clients
> as well as these new browsers that can deal with the proposal
> you're promoting?

Well, the current clients may fall in two categories. The first
is one that doesn't handle kanji at all. Then we are out of
luck, but that may not be too bad, as we couldn't view the
retrieved document anyway. The second category will try to
send the characters as some natively encoded URL. For the
case above (sparsely populated namespace), heuristically
guided trial-and-error on the server side will cover even
cases when there are more than one frequently used native
encoding. But maybe this is too much work. Then for a
transitory period, we can just publish two URLs, one
with full kanji and the other with an ASCII alternative
or with the same thing with %HH. Or some third party can
offer a service for old browsers (e.g. a form where
you can type in the kanji as you see them, with the
traditional method (Roy's 2c+tricks), do the conversion
to UTF-8, and then do a redirect). The web is extremely
flexible, one just has to be ready to use this flexibility
before claiming that things "won't work".

Anyway, I would definitely advise people to wait a little
bit (one browser generation or so) before they start to
put out URLs with native characters.



> > I hope this explanation is detailled enough. If you don't understand
> > some part of it, please tell us.
> 
> As you see, it was inadequate for the purposes of
> being a stand-in for 'running code': there are 
> a number of unresolved design issues in your plan,
> those design issues must be resolved before interoperable
> implementations can be deployed, and I'm uncertain
> as to whether the results, when taken in total,
> actually solve the problem you set out to solve,
> or even improve the situation significantly. And
> given the difficult transition strategy and lack
> of interoperability with currently deployed systems,
> I doubt that a proposal will actually be adopted
> unless that's so.

I am very sure that the general design is well-developped,
and that the web is flexible enough to have solutions
available for the transition.


Regards,	Martin.
Received on Monday, 21 April 1997 11:48:35 UTC